Take-home_Ex03

Author

LIU YAN

Published

June 17, 2023

1. Overview

FishEye International, a non-profit organization dedicated to combatting illegal, unreported, and unregulated (IUU) fishing, has been granted access to an international finance corporation’s comprehensive database on fishing-related companies. Based on prior investigations, FishEye has observed a strong correlation between companies exhibiting anomalous structures and their involvement in IUU activities or other suspicious practices. To facilitate their efforts, FishEye has transformed the database into a knowledge graph, encompassing extensive information about the companies, owners, workers, and financial status.

ESGCLARITY

The objective of this exercise is to use visual analytics to identify anomalies in the business groups present in the knowledge graph.

This exercise source is from: Mini-Challenge 3 task 1

2. Data Preparation

2.1 Install R Packages and Import Dataset

The code chunk below will be used to install and load the necessary R packages to meet the data preparation, data wrangling, data analysis and visualisation needs.

show the code
pacman::p_load(jsonlite, tidygraph, ggraph, 
               visNetwork, graphlayouts, ggforce, 
               skimr, tidytext, tidyverse,igraph,grid,gridExtra,DT,RColorBrewer)

jsonlite: A simple and robust JSON parser and generator for R.

tidygraph: this package provides a tidy API for graph/network manipulation.

ggraph: ggiraph is a tool that allows you to create dynamic ggplot graphs.

visNetwork: an R package for network visualization, using vis.js javascript library.

graphlayouts: Several new layout algorithms to visualize networks are provided which are not part of ‘igraph’.

ggforce: aims to be a collection of mainly new stats and geoms that facilities for composing specialised plots.

skimr: is designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors.

tidytext: make many text mining tasks easier, more effective, and consistent with tools already in wide use.

tidyverse: A collection of core packages designed for data science, used extensively for data preparation and wrangling.

igraph:igraph is a fast and open source library for the analysis of graphs or networks.

grid: provides a set of functions and classes that represent graphical objects.

gridExtra:provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables.

DT: provides an R interface to the JavaScript library DataTables.

RColorBrewer:provides color schemes for maps (and other graphics) designed by Cynthia Brewer.

2.2 Data Introduction

In the code chunk below, fromJSON() of jsonlite package is used to import MC3.json into R environment.

show the code
mc3_data <- fromJSON("data/MC3.json")

The output is called mc3_data. It is a large list R object.The knowledge graph (KG) encompasses a vast network comprising 27,622 nodes and 24,038 edges, forming 7,794 interconnected components. Notably, this KG is an undirected multi-graph, and at the top-level, the graph is represented as a dictionary with graph-level properties.

Node Attributes:

type – Type of node as defined above.

country – Country associated with the entity. This can be a full country or a two-letter country code.

product_services – Description of product services that the “id” node does.

revenue_omu – Operating revenue of the “id” node in Oceanus Monetary Units.

id – Identifier of the node is also the name of the entry.

role – The subset of the “type” node, not in every node attribute.

dataset – Always “MC3”.

Edge Attributes:

type – Type of the edge as defined above.

source – ID of the source node.

target – ID of the target node.

dataset – Always “MC3”.

role - The subset of the “type” node, not in every edge attribute.

2.3 Initial Data Exploration

In this section, we undertake data exploration techniques to enhance our understanding of the dataset.

2.3.1 Extracting Edges & Nodes

The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.

show the code
# Convert links data to tibble format
mc3_edges <- as_tibble(mc3_data$links) %>%  
  distinct() %>% # Remove duplicate edges
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  # Group edges by source, target, and type
  group_by(source, target, type) %>%
  # Calculate the number of edges for each source/target/type combination
    summarise(weights = n()) %>%
  filter(source!=target) %>%
  ungroup()
Note

distinct() is used to ensure that there will be no duplicated records.

mutate() and as.character() are used to convert the field data type from list to character.

group_by() and summarise() are used to count the number of unique links.

thefilter(source!=target)is to ensure that no record with same source and target.

show the code
DT::datatable(mc3_edges)

The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.

show the code
# Convert nodes data to tibble format
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>%
  # Select specific columns for the resulting tibble
  select(id, country, type, revenue_omu, product_services)
Note

mutate() and as.character() are used to convert the field data type from list to character.

To convert revenue_omu from list data type to numeric data type, we need to convert the values into character first by using as.character(). Then, as.numeric() will be used to convert them into numeric data type.

select() is used to re-organise the order of the fields.

show the code
DT::datatable(mc3_nodes)

Based on the table provided, it is apparent that certain observations have product services that are unrelated to the fishing industry. For example, Jones LLC(id) offers product services related to automobiles rather than fishing.

2.3.2 Text Sensing with tidytext

From section 2.3.1, a notable observation was made regarding the presence of numerous nodes whose product_service is unrelated to the fishing industry. To ensure the effectiveness of subsequent analyses, it is imperative to filter out these irrelevant nodes. Thus, we will employ text sensing techniques and utilize the tidytext package to analyze the keywords present in the product_service attribute of the nodes.

In this section, tokenisation will be used in the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.

In the code chunk below, unnest_token() of tidytext is used to split text in product_services field into words.

show the code
token_nodes <- mc3_nodes %>%
  unnest_tokens(word, 
                product_services)
Note

By default, punctuation has been stripped.

By default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE argument to turn off this behavior).

The two basic arguments to unnest_tokens() used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).

the tidytext package has a function called stop_words that will help us clean up stop words.

show the code
# New words to be added
new_words <- data.frame(word = c("unknown", "character","0", "products","equipment","services","accessories","related"))

# Combine existing stop words with new words
updated_stop_words <- bind_rows(stop_words, new_words)

stopwords_removed <- token_nodes %>% 
  anti_join(updated_stop_words)
Note

Create new stop words and update the stop_words by bind_rows

Then anti_join() of dplyr package is used to remove all stop words from the analysis.

We can visualise the words extracted by using the code chunk below.

show the code
stopwords_removed %>% 
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")

Note

count() counts the frequency of each unique word in the ‘word’ column, sorting the result in descending order.

top_n(15) selects the top 15 words with the highest frequency.

mutate() reorders the ‘word’ column based on the frequency (‘n’) of each word.

coord_flip() flips the coordinates, resulting in a horizontal bar plot.

Based on the above bar chart, it is evident that three fishing related specific keywords, namely “fish”,“seafood”,“salmon” exhibit relatively higher percentages compared to other words. Consequently, we will employ these three key words as filters to select nodes and edges for further analysis in subsequent steps.

2.4 Data Wrangling

2.4.1 Data Preparation_Fishing Edeges & Nodes Filter

In this section, we will create an analysis subset of nodes and edges specifically related to the terms “fish” ,“seafood”,“salmon”.

It is important to note that during our investigation, we discovered that only the “source” nodes in the mc3_edges dataset could be found in the mc3_nodes dataset. Therefore, when filtering for fishing-related edges, we will focus on the “source” node’s product_service attribute, specifically checking if it contains the keywords of interest. This filtered subset of edges will be referred to as mc3_edges_fishing.

Regarding the extraction of nodes, we will utilize the mc3_edges_fishing dataset. From this dataset, we will extract both the “source” and “target” node IDs. For the “source” nodes, we will apply a filter based on whether their product_service attribute contains the specified keywords. As for the “target” nodes, they will inherit the type attribute from the mc3_edges_fishing dataset.

By following this approach, we will establish a refined analysis dataset consisting of relevant nodes and edges, enabling further examination of the fishing domain.

show the code
# Filter edges based on product_service criteria
mc3_edges_fishing <- mc3_edges %>%
  filter(
    source %in% mc3_nodes$id[grep("fishing|seafood|salmon", mc3_nodes$product_services, ignore.case = TRUE)]) %>%
  distinct()


# Extract relevant node IDs from mc3_edges_fishing
related_node_ids <- unique(c(mc3_edges_fishing$source, mc3_edges_fishing$target))

# Filter nodes based on related_node_ids and product_service criteria
mc3_nodes_source <- mc3_nodes %>%
  filter(
    id %in% related_node_ids &
    grepl("fishing|seafood|salmon", product_services, ignore.case = TRUE)
  )

# mc3_nodes2 now contains the filtered nodes
mc3_nodes_target <- mc3_edges_fishing %>%
  select(target, type) %>%
  rename(id = target)

# Create an empty dataframe containing all variables
empty_df <- data.frame(id = character(),country=character(), type = character(), revenue_omu=numeric(), product_services=character(),stringsAsFactors = FALSE)

# Fill in missing columns to match the number of columns 
mc3_nodes_source <- bind_rows(mc3_nodes_source, empty_df)
mc3_nodes_target <- bind_rows(empty_df, mc3_nodes_target)

# Merge mc3_nodes_source and mc3_nodes_target
mc3_nodes_fishing <- bind_rows(mc3_nodes_source, mc3_nodes_target)

# Aggregated MC3_ Nodes_ Fishing1
mc3_nodes_fishing <- mc3_nodes_fishing %>%
  group_by(id, country, type)%>%
  summarise(revenue_omu = sum(revenue_omu))%>%
  ungroup()
Note

grep("pattern", x, ignore.case = TRUE) searches for the specified “pattern” in the vector or character string “x” and returns the matching elements.

unique() returns the unique elements of the vector “x” by removing any duplicates.

distinct() removes duplicate rows from a data frame, keeping only the unique rows.

bind_rows() combines multiple data frames or tibbles by stacking them vertically to create a new data frame.

2.4.2 Data Preparation_Closeness & Degree Centrality

To analyze the network connection structure, we construct a tbl_graph object by utilizing the newly created mc3_nodes_fishing and mc3_edges_fishing datasets. This tbl_graph object represents the network structure and allows us to perform various network analysis tasks and explorations.

show the code
mc3_graph_fishing <- tbl_graph(nodes = mc3_nodes_fishing,
                       edges = mc3_edges_fishing,
                       directed = FALSE) %>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness())

Degree centrality is a measure that quantifies the number of edges connected to a specific node in a network. It serves as an indicator of the node’s influence or centrality within the network, as nodes with a higher degree centrality exhibit a larger number of connections. In the context of the mc3_graph_fishing network, the degree centrality represents the number of companies or people that are connected to a particular node.

In order to capture this information, we calculate the degree centrality for each node in the mc3_graph_fishing network. The resulting degree centrality values are then stored as a variable within the mc3_nodes_fishing dataset.

show the code
# Calculate the node degree centrality for mc3_graph_fishing 
node_degrees <- degree(mc3_graph_fishing)

# Create a new column "node_degrees" for mc3_nodes_fishing
mc3_nodes_fishing <- mc3_nodes_fishing %>%
  mutate(degree = node_degrees)

# Plot Bar chart
ggplot(mc3_nodes_fishing, aes(x = degree)) +
  geom_bar(fill = "#808de8", color = "black") +
  labs(title = "Bar Chart of Degree Centrality",
       x = "Degree Centrality",
       y = "Frequency")

Closeness centrality is a measure that quantifies how close a node is to all other nodes in a network.In the context of the mc3_graph_fishing network, we compute the closeness centrality for each node. This involves determining the average shortest path length from a given node to all other nodes in the network. Nodes with shorter average shortest path lengths will exhibit higher closeness centrality scores.

The computed closeness centrality values are then stored as a variable within the mc3_nodes_fishing dataset.

show the code
# Calculate closeness centrality for mc3_graph_fishing
closeness_values <- mc3_graph_fishing %>%
  pull(closeness_centrality)

# Create a new column "closeness_centralit" for mc3_nodes_fishing
mc3_nodes_fishing <- mc3_nodes_fishing %>%
  mutate(closeness_centrality = closeness_values)

# Plot histogram
ggplot(mc3_nodes_fishing, aes(x = closeness_centrality)) +
  geom_histogram(binwidth = 0.02, fill = "#808de8", color = "black") +
  labs(title = "Histogram of Closeness Centrality",
       x = "Closeness Centrality",
       y = "Frequency")

2.4.3 Data Preparation_Connection Object Count

Given that the network consists of three types of nodes: “Company,” “Company Contact,” and “Beneficial Owner,” we can calculate the following counts for each node:

• The number of “Company” nodes connected to it.

• The number of “Company Contact” nodes connected to it.

• The number of “Beneficial Owner” nodes connected to it.

show the code
# Initialize variables to store the counts
company_qty <- vector("integer", length = vcount(mc3_graph_fishing))
contacts_qty <- vector("integer", length = vcount(mc3_graph_fishing))
owner_qty <- vector("integer", length = vcount(mc3_graph_fishing))

# Iterate over each node
for (i in 1:vcount(mc3_graph_fishing)) {
  node <- V(mc3_graph_fishing)[i]
  neighbors <- neighbors(mc3_graph_fishing, node, mode = "all")
  
  # Count the number of each type of node among the neighbors
  company_qty[i] <- sum(V(mc3_graph_fishing)[neighbors]$type == "Company")
  contacts_qty[i] <- sum(V(mc3_graph_fishing)[neighbors]$type == "Company Contacts")
  owner_qty[i] <- sum(V(mc3_graph_fishing)[neighbors]$type == "Beneficial Owner")
}
# Add the counts as new columns to the node data frame
mc3_nodes_fishing$company_qty <- company_qty
mc3_nodes_fishing$contacts_qty <- contacts_qty
mc3_nodes_fishing$owner_qty <- owner_qty
Note

vector() Creates an empty vector with a specified length or type.

V() is a function used to access the nodes (vertices) of a graph. In this code, V(mc3_graph_fishing) retrieves all the nodes in the mc3_graph_fishing graph.

sum() Calculates the sum of values.

To facilitate this analysis, we create three new variables: “company_qty,” “contacts_qty,” and “owner_qty” to store the respective counts of connected nodes for each node in the network. These variables will provide valuable insights into the connectivity patterns between different types of nodes and enable further examination of the network’s structure and relationships.

show the code
# Create each plot
plot1 <- ggplot(data = mc3_nodes_fishing, aes(x = company_qty)) + geom_bar(fill="#808de8",color = "black")+
  ggtitle("Number of Companies Connected to Each Node")+
  theme(plot.title = element_text(size = 7))

plot2 <- ggplot(data = mc3_nodes_fishing, aes(x = contacts_qty)) + geom_bar(fill="#808de8", color = "black")+
  ggtitle("Number of Company Contacts Connected to Each Node")+
  theme(plot.title = element_text(size = 7))

plot3 <- ggplot(data = mc3_nodes_fishing, aes(x = owner_qty)) + geom_bar(fill="#808de8", color = "black")+
  ggtitle("Number of Beneficial Owners Connected to Each Node")+
  theme(plot.title = element_text(size = 10))

# Combine the plots with modified layout
combined_plot <- grid.arrange(
  arrangeGrob(plot1, plot2, ncol = 2),
  plot3,
  heights = c(2, 1)
)

3. Data Visualisation

In this section, we will generate the network representation of nodes and edges related to the fishing industry. Subsequently, we will conduct an analysis of the network, with a specific focus on highlighting the anomalies within the network.

3.2 Network for Anomalies Nodes

Based on the findings from section 2.4.3, which examined the count of connection objects in the data, we observed that the majority of nodes in the network were associated with either zero or one company connection. A smaller number of nodes were connected to two companies, and only a very limited number of nodes were connected to three companies.

Therefore, we can infer that nodes, especially the “beneficial_owner_nodes” connected to three companies have a higher potential to be classified as anomalies within the network.

show the code
# Filter nodes with type "Beneficial Owner"
beneficial_owner_nodes <- mc3_nodes_fishing[mc3_nodes_fishing$type == "Beneficial Owner", ]

# Create bar chart for "company_qty"
plot <- ggplot(beneficial_owner_nodes, aes(x = company_qty)) +
  geom_bar(fill = "#808de8", color = "black") +
  ggtitle("Company Quantity for Beneficial Owner Nodes")

# Print the plot
print(plot)

By referring to the data table for mc3_nodes_fishing, it becomes evident that there is only one node, identified as “James Brown” that exhibits connections with three different companies. This unique characteristic of James Brown’s node suggests a higher likelihood of it being an anomaly within the network.

show the code
library(DT)

# Assuming mc3_edges is your datatable object
datatable(mc3_nodes_fishing, options = list(scrollY = "500px"))
show the code
# Create a new column 'Index' with row numbers
mc3_nodes_fishing$Index <- seq_len(nrow(mc3_nodes_fishing))

# Select the nodes based on the given indices
selected_indices <- c(697, 73, 670, 426, 1623, 322, 58, 977, 327, 797, 302, 657, 727, 641, 918, 1230, 1712, 627, 217, 814, 73, 1798, 1366, 61, 897, 1217, 222)
selected_nodes <- mc3_nodes_fishing[mc3_nodes_fishing$Index %in% selected_indices, ]
selected_id <- selected_nodes$id

# Find the edges where both source and target nodes are from selected nodes' IDs
selected_edges <- mc3_edges_fishing[mc3_edges_fishing$source %in% selected_id & mc3_edges_fishing$target %in% selected_id, ]
colnames(selected_edges)[1:2] <- c("from", "to")

# Assuming 'selected_nodes' and 'selected_edges' are your existing data frames

# Add a new column 'size' based on the 'degree' column, scaled by a factor
selected_nodes$size <- selected_nodes$closeness_centrality * 1000

# Create a color palette based on the unique values in the 'type' column
palette <- brewer.pal(n = length(unique(selected_nodes$type)), name = "Set3")

# Assign colors based on the 'type' column
selected_nodes$color <- palette[as.factor(selected_nodes$type)]


# Create the visNetwork plot
g <- visNetwork(nodes = selected_nodes, edges = selected_edges, height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visNodes(color = ~color, size = ~size) %>%
  visEdges(color = list(highlight = "lightgray")) %>%
  visOptions(selectedBy = 'type',
             highlightNearest = list(enabled = TRUE,
                                     degree = 1,
                                     hover = TRUE,
                                     labelOnly = TRUE),
             nodesIdSelection = TRUE) %>%
  visLegend() %>%
  visInteraction(navigationButtons = TRUE) %>%
  visLayout(randomSeed = 123)

g
show the code
# Convert the network object to igraph object
g <- as.igraph(mc3_graph_fishing)

# IMPORTANT ! set vertex names otherwise when you split in sub-graphs you won't be able to recognize them
g <- set.vertex.attribute(g,'name',index=V(g),as.character(1:vcount(g)))

# decompose the graph
sub.graphs  <- decompose.graph(g)

# search for the sub-graph indexes containing 2 and 9
sub.graph.indexes <- which(sapply(sub.graphs,function(g) any(V(g)$name %in% c('697'))))

# merge the desired subgraphs
merged <- do.call(graph.union,sub.graphs[sub.graph.indexes])

plot(merged)

Upon observing the aforementioned visualization, it becomes apparent that the “James Brown” node exhibits a distinct pattern within the network. Specifically, it is connected to three separate company nodes, which stands out as a highly unique and notable characteristic within the overall network connections.

4. Future Work

  1. In section 3.2, “Network for Anomalies Nodes,” we utilized igraph to extract the network associated with the node “James Brown” However, we encountered a limitation where the id labels couldn’t be displayed next to the nodes. Consequently, we resorted to manually inputting the node’s id in the network to create new nodes and edges for generating the visNetwork plot. In future research, it is advisable to focus on either automating the creation of new nodes and edges or utilizing igraph to draw the network while incorporating id labels next to each node.

  2. Regarding the anomalies nodes identified in the preceding sections, it is imperative to gather additional evidence to substantiate the involvement of anomalous companies in illegal fishing activities.

  3. A visual analytics process should be developed to identify similar businesses and measure the similarity among the grouped businesses from the previous question. This entails leveraging visual analytics techniques and tools to explore and analyze the shared characteristics, patterns, or behaviors among the businesses in the identified groups.

5. Reference

  1. Kam Tin Seong,Take-home Exercise 3: Kick-starter,2023(link)
  2. Katherine Ognyanova,Network Analysis and Visualization with R and igraph,NetSciX 2016 School of Code Workshop,2016(link)
  3. Fatima Rubio,When is the Closeness Centrality Algorithm best applied,Graphable,2023(link)
  4. Alex Derr,Network Centrality: Understanding Degree, Closeness & Betweenness Centrality,VisibleNetworkLabs,2021(link)